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1. INTRODUCTION 

Countries globally have prepared themselves for open war. Defense planning has become their focus 
to deal with conventional warfare [1]. In conventional warfare, especially in conflict areas, it is necessary for 
soldiers to detect the presence of humans around them and to classify whether the humans detected are 
combatants, non-combatants, civilians, enemies, or friends. As an anticipatory step for regional security and 
defense, the determination of human identification and classification can only be done under certain conditions 
by using the human sense of sight (eyes). 

In recent years, digital military applications have increasingly widespread, both in communication and 
detection [2]—[4]. Researchers have solved the problem of military object detection in various ways. Some of 
them use hyperspectral imaging and remote sensing [5], distributed sensors [6], [7], and real-time video 
surveillance [8], [9]. Deep learning on military object detection using video surveillance is also used in weapons 
installation [10]. 

Detection and recognition systems are growing rapidly with the discovery of the triplet loss algorithm 
by Schroff et al. [11] which was able to produce an accuracy of 99.63% in the case of facial recognition with 
the labelled faces in the wild (LFW) dataset. Triplet loss has input in the form of a triplet image consisting of 
an anchor image (xa), a positive image (xp), and a negative image (xn). The distance between the anchor and 
the positive will be minimized because they have the same identity while the distance between the anchor and 
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negative will be maximized because they have different identities [11]. The key to the success of triplet loss is 
the use of the triplet selection method which is also applied in different applications [12]-[16]. But along with 
these developments, it turns out that several studies have also emerged as corrections and refinements of triplet 
loss. A major caveat of the triplet loss, though, is that as the dataset gets larger, the possible number of triplets 
grows cubically, rendering a long enough training impractical [17], FaceNet picked a random negative for 
every pair of anchor and positive, which was very time-consuming [18], [19] imports an additional constraint 
in the traditional triplet loss, which limits the distances of positive pairs to be smaller than a pre-defined value. 
This current study analyzed the effectiveness of an application program combined with a camera can 
identify and classify the presence of humans, thus replacing the function of the eye. This study introduced a 
new loss function called Electrostatic Loss, modified based on the analysis of the Triplet Loss algorithm 
associated with electrostatic force, charged particle physics based on Coulomb's Law. Some corrections to 
Triplet Loss that will be answered in this study are in the discussion of Triplet Loss it is known that (x,) will 
be moved away from (xa), but does not distance them both (x, and xp) [18], does not discuss how close (xa) and 
(Xp) is so that it is possible to create clusters where the distance between intra class becomes large [17], and 
does not determine the magnitude and direction of (x„) displacement so that convergent conditions will be more 
difficult to achieve [20]. Therefore, a new Electrostatic Loss algorithm is proposed to give better results. 


2. METHOD 
2.1. Related work 

The loss function is a function that measures the performance of electrostatic loss in predicting the 
target [21]. The loss function works when the model used makes an error. If the loss function can produce the 
lowest error, it can be considered to function properly. 


2.1.1. History of loss function development 

Loss function development as shown in Figure 1 begins with a study on face recognition (FR) using 
deep FR with deep face [22] and deep ID [23]. After that, the rules of Euclidean distance-based loss known as 
contrastive loss, triplet loss, and center loss was developed. In 2016 and 2017, L-softmax [24] and A-softmax 
[25] were designed to promote the development of large-margin feature learning. In 2017, the normalization 
of features and weights also began to show good performance, which led to research on the softmax variation. 
In Figure 1, the sections colored red, green, blue, and yellow are representations of a more in-depth softmax 
method. Where Euclidean distance-based loss, angular/cosine-based loss, and variations of softmax could be 
identified sequentially. The softmax loss function is a reference generally used in object recognition, but it is 
not effective enough for FR. The intra-variation can be larger than the inter-difference, and more detailed 
features are required when recognizing different people [26]. Previous research has robustly focused on 
creating new loss functions to make features more separable and detailed. 
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Figure 1. History of loss function development [26] 


2.1.2. Triplet loss 
Triplet loss is an algorithm that discusses the loss function in the FR process. The triplet loss algorithm 
is applied in open face as a component of the stochastic gradient descent process during training [27]. It was 
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first proposed by Schroff et al. [11] in the journal of FaceNet, who suggest that FR with the LFW dataset could 
produce an accuracy of 99.63%. Figure 2 illustrates the mechanism of triplet loss. 
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Figure 2. Triplet loss minimizes the distance between anchor and positive (both have the same identity information) 
and maximizes the distance between anchor and negative (both have different identity information) [11] 


Triplet loss has input in the form of a triplet image consisting of an anchor image, positive image, and 
negative image. In a notational form, it can be written as (Xa Xp, Xn) in which x, is the anchor image, x, is the positive 
image, and xy is the negative image. Triplet loss aims to minimize the distance between the anchor image and the 
positive image than with the negative image [11]. Formally, the formula of triplet loss can be seen in (1)-(3): 


xe- xP +a < IIxt - xP 3 (1) 
V (xf, xP, AET (2) 
L= |I- FEDI- IEA- FEDI +a] 6) 


where [Z], = max (f (z), 0) = es Y and f (x#); f (xP); f (xf?) is a feature of the three input 


images which are generally normalized during training. 

In addition, triplet loss has a a parameter, the margin between the positive and negative pair. To measure 
the similar results of feature extraction between the two images, triplet loss uses Euclidean distance. The 
parameter in its application is 0.2 to produce a difference between Anchor-positive pair and anchor-negative pair 
that is relatively far enough and may result in good performance in the training process. Meanwhile, triplet pairs 
that have a value outside the margin will be ignored as the training process may fail. Thus, it can be concluded 
that in triplet loss not all available triplet pairs can be used. Meanwhile, J is all possible triplet sets of images in 
the training which values are up to N. To summarize, the triplet loss value in (3) can be simplified as (4): 


L = max (d(a,p) — d(a,n) + margin, 0) (4) 


The symbol a is the anchor image, p is the positive image, and n is the negative image. While d is defined as 
the distance in the embedding space. From (4), it can be interpreted that triplet loss minimizes Loss by making 
the value of d(a,p) close to 0 and the value of d(a,n) more than the sum of d(a,p) with a margin. 


2.2. Overall framework 

This study was designed for end-to-end military detection and recognition based on deep metric 
learning and the convolutional neural network (CNN), using the concept of loss function named electrostatic 
loss. The overall framework is divided into two, namely the training and testing process, as well as the 
production process. Figure 3 shows the training and testing process begins with converting several combatants, 
non-combatants, and civilian videos into image sequences. The image sequences were then annotated to 
determine the position of combatants, non-combatants, and civilians. As the annotation results were still of 
various sizes, the dimensions were homogenized (resizing images) and followed the input according to the 
architecture. Then, each image was processed with the siamese network, a type of neural network architecture 
that accepts two or more inputs. These inputs were then processed in the same subnetwork and then combined 
to calculate the similarity between the two inputs. The output of the siamese network in this study is a vector 
with N-dimensions. Electrostatic loss as an objective function was then used to determine the error of the 
siamese network’s output. The result of the training is the weight of the network architecture which can later 
be used for classification, which include combatants, non-combatants, and civilians in this study. 
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Figure 3. Training and testing process 


Figure 4 exhibits that the production process starts from the image produced by the military camera, 
which would then be processed by MobileNet-SSD as a person detection. The resulting image would then be 
resized according to the requirements of the input network architecture used. Next, the image would be 
processed with a network architecture that already weights the previous training process via electrostatic loss. 
The output of this process was the embedding vector of the image, which would be compared with the center 
point of the vector image cluster for each class to calculate the distance using the Cosine similarity method. 
The distance between the vectors shows similarities between images and combatants, non-combatants, or 
civilians. The closer the image vectors are, the more similar they are to combatants, non-combatants or 
civilians. If the input image vector is far from the reference vector, it will be classified as unknown. 
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Figure 4. Production process 
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2.3. Object detection and classification 
2.3.1. Object annotation and dataset 

To build a system/model design with machine learning, one of the main things needed is a dataset. 
Due to the scarcity of military dataset, the authors of this study created their own dataset. First, several videos 
on YouTube that showed the presence of combatants, non-combatants, and civilians were searched. These 
videos were then converted into image sequences which were annotated to create bounding boxes, positions, 
and classification for combatants, non-combatants, and civilians. The annotations yielded a dataset in a pascal 
visual object classes (VOC) format. Figure 5 shows the object annotation process. 


2.3.2. Person detection 

Figure 4 demonstrates person detection is the initial stage needed as input in the classification process 
on non-combatants or civilians. In this study, the MobileNet-SSD network architecture was used to detect 
humans in the prepared video frame. MobileNet is efficiently designed for mobile applications or embedded 
computer vision [28]. In several studies, the MobileNet-based architecture for object detection reached 4.5 FPS 
when running on a Raspberry Pi [28]. 

This detection process shows the position of the bounding box of the person associated with the video 
frame (x1, X2, Y1, Y2). The objects would be sent to the object classification process. Figure 6 shows a diagram 
of the MobileNet-SSD network architecture. 
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Figure 5. Object annotation process in labelimage [29] 
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Figure 6. MobileNet-SSD network architecture [30] 


2.3.3. Network architecture 

In this study, the Inception ResNet-v2 network architecture was used to train and validate the 
system/model design. Inception was first introduced by Szegedy et al. [31] in their research on the development 
of the CNN in 2014. Recently, very deep convolutional networks have become the main development of image 
recognition. With its relatively low computation, inception can produce excellent performance. 

Several versions of the inception network include Inception-v1 in 2014 [32], Inception-v2, and v3 
produced in 2015 [31], as well as Inception-v4, Inception-ResNet-v1, and Inception-ResNet-v2 produced in 
2016 [33]. All of these versions developed by Szegedy et al. [33] were always renewed from the previous 
version. In this study, two network architectures, namely Inception-ResNet-v2 and Inception-v4, were 
compared and used in the next process. The dimensional vector output for Inception-ResNet-v2 and Inception- 
v4 is 1,000 classes. Therefore, in this study, a fully connected layer was added after the last layer with an N- 
dimensional size to obtain the N embedding vector, as shown in Figure 7. To learn the Inception-ResNet-v2 
design, a new loss function, electrostatic loss, was proposed. 
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Figure 7. Inception-ResNet-v2 with additional fully connected layer 
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2.3.4. Embedding dimensionality 

In this study, experiments were conducted using several embedding dimensionalities, i.e., 32, 64, and 
128, derived from 1,000 classes from two network architectures, namely Inception-Resnet-v2 and Inception- 
v4. From the embedding dimensionality and network architecture, we could then compare the results of the 32- 
dimensional Inception-ResNet-v2, 64-dimensional Inception-ResNet-v2, 128-dimensional Inception-ResNet- 
v2, and 128-dimensional Inception-v4. 


2.4. Electrostatic loss 

The triplet loss method proposed by Schroff et al. [11] becomes the basis of the proposed system. The 
triplet loss could produce an accuracy of 99.63% in FR with the LFW dataset. It has some algorithm aspects 
as follows: 

— The process of moving f(xn) away from Axa) does not necessarily move f(xp) away, possibly causing the 
vector distance between f(x.) and f(x,) to not be greater than the vector distance between f(x.) and 
fx) [18]. 

— When an error occurs in choosing an image pair, the error calculation will always be the same and be 
repeated many times so that it takes longer for the network to converge. Therefore, it applies hard triplet 
and semi-hard triplet mining [11], [14] for pair selection [17]. 

— The value of the alpha parameter (a) is a constant value. However, it is quite difficult to determine a good 
value in a case. Thus, sometimes the selection of the parameter value is based on the researcher’s 
intuition [18], [34]. 

Based on these findings, electrostatic loss entails a charged particle physics analysis or the electrostatic 
force. The attractive force and repulsion between charged particles, in this study, are used as an analytical 
approach in approaching and moving away each vector element from the anchor image (f(xa)), positive image 
(f(xp)) and negative image (f(x,)). The electrostatic force is stated in (5). 


P = (e 2#) ® 


where: 

F = electrostatic force (N) 

K = coulomb’s constant (Nm7/C?) 
qi= magnitude of charge 1 (C) 
q2= magnitude of charge 2 (C) 

r° = distance between charges (m) 


According to the Coulomb's law, particles with the same type of charge will experience repulsive 
force, and particles with different types of charge will experience attractive force. In the application of 
electrostatic loss, changes are made to the attractive and repulsive force to adjust the conditions that exist in 
each vector element of the anchor image (f(xa)), positive image (f(xp)), and negative image (f(x,,)). In this study, 
we changed the behavior for the same charge (images) to attract presented in Table 1. However, the basis of 
force vectors is still based on the Coulomb's law. 


Table 1. Adjustment of the rules between Coulomb's law and electrostatic loss to images which are analogous 


to a charged particle 
No. Charged particel (image) _ Coulomb’s law Electrostatic loss 


1. + @ Repulsive Attractive 
2: e@ + e Attractive Repulsive 
3: © + @ Attractive Repulsive 


where: 

e@ = anchor (positive) 
@ = positive 

@ = negative 


Figures 8 and 9 illustrate the analysis of the forces that occur in triplet loss and electrostatic loss. The 
forces will affect the direction and strength of attraction or repulsion between elements of the vector images. 
In other words, f(x,) will move away from xa) and f{xp). While, f(xa), f(xp), and fxn) in electrostatic loss are 
charged particles defined as charge q € (0,1). 
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Figure 8. Illustration of style analysis in triplet loss (Fripiet loss) 


Fap 


Negative (x,) 


Positive (x,) 


Fpn 


Figure 9. Illustration of force analysis in electrostatic loss (Fetectrostatic loss) 


The value of q is obtained from one of the output values of 128-dimensional vectors. The size of the 
charge q’s value determines the size of the attractive and repulsive forces between charges. From this concept, 
a value analysis of electrostatic loss can be calculated as follows (6)-(8): 


L Electrostatic = X} pn Lap + Lan + Lpn) (6) 
= YN pnld(a, p) - dap. + [ala n) - dan’), + [d@,n) - dn. ) 
=al - FP IF - IPOD — FP UL], 

HIED — FE- IF GP) - FOP IIL 

HeD -feli - Fe?) - FER], (8) 


where [Z], = max(f(z),0) = oe j and f(x#); f(x?); f(x) is a vector value with N 


dimensions of the anchor, positive and negative images. Meanwhile, f (x? ") and f (x?) are vector values after 
displacement, influenced by electrostatic forces at a certain time unit. Based on the electrostatic force shown 
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in (5), it can be explained that the forces experienced by the anchor, positive and negative images can be 
formulated in (9)-(11). 


Fap = Fa = (E sea) (9) 
ap 

Fan = Fra = (k e) (10) 

Fup = Fon = (k ‘nt (11) 
np 


qa, qp, and gn are vector value variables taken from the first output embedding index of each anchor, positive 
and negative image. Fap is the force that arises from vector a to p; Fan is the force that emerges from vector a 
to n; and Fy, is the force that arises from vector n to p. 

Determination of other parameter errors can be done by using the velocity variable a which will 
determine how far the g (charges) should approach (attract) or move away from each other (repulse) in a certain 
time unit. If the charges approach each other (attract), the position x” " can be formulated in (12). 

ts 

xP = Fog (xp + x7) + Fal? + xf) (12) 

If the charges move away from each other (repulse), it can be calculated in (13). 


xf! = Faa (xp + xf) + Frpa(xf + xP) (13) 


2.5. Evaluation metrics of research results 

The last stage in this study is the evaluation metrics of research result. This stage is carried out to 
measure and determine the performance of the system/model design that has been made. There are various 
ways/methods that can be used to measure and determine the performance of a system/model design in research 
similar to this research. In this study, evaluation metrics of research result was carried out using several 
methods, namely principal component analysis (PCA), accuracy, mean average precision (mAP), R-precision, 
adjusted mutual information (AMI), and normalized mutual information (NMI). 


2.5.1. Principal componen analysis 

The quantity of research data will affect the data analysis. Hence, a technique is needed to 
simplify/reduce the dimensions of research data. One tool used to reduce the dimensions of data without 
reducing its characteristics is PCA [35]. The decomposition of the eigenvalues and eigenvectors of the PCA 
covariance matrix will produce the principal components [36]. The PCA in this study is useful for the visual 
analysis of embedding vectors at 32, 64, and 128 dimensions. 


2.5.2. Accuracy 

Accuracy is the percentage of the test set tuples that are correctly classified by the classifier [37]. In 
other words, the accuracy value is obtained from the comparison between the accurately classified data and the 
whole data. In the classification process, accuracy measurement is carried out using (14) [37]. 


TP+IN 
Accuracy = uN X 100% (14) 

where: 

TP = true positive is the number of positive data that is classified correctly by the system 

TN = true negative is the number of negative data that is classified correctly by the system 

FP = false positive is the number of positive data but classified incorrectly by the system 

FN = false negative is the number of negative data classified incorrectly by the system 


2.5.3. Mean average precision 

Average precision (AP) and mAP are the most popular metrics used to evaluate an object detection 
system/model design. AP is calculated individually for each class resulting in AP values as many as the number 
of classes. These AP scores are averaged to obtain the mAP across all classes. AP is the value obtained from 
each relevant item precision value which uses a value of 0 for relevant items that are not generated by the 
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system. The precision value of AP is calculated by taking into account the order of items given by the system. 
The (15) used to calculate the value of mAP [38]. 


-4 yhli p” ‘oj 
mAP = al er a Èp- Precision (15) 
where: 
Q = number of test queries 
R = relevant items generated by the system 
m = the number of relevant items generated from the query 


2.5.4. R-precision 

R-precision is one of the most frequently used parameters for measuring the accuracy of a system 
design/recognition model. Although it is empirically stated that R-precision is a good measure, the theoretical 
reasons are not clear [39]. Aslam and Yilmaz applied a simple geometric interpretation for their research and 
theoretically proved why R-precision is called a very informative parameter. The value of R-precision on recall 
r is shown in (16) [39]. 


1-r 


p(r) = (16) 


1+ ar 


Where a=(1/rp-1)°-1. This value of a ensures that the curve passes through the point (rp, rp). 


2.5.5. Adjusted mutual information 

The AMI is a variation of mutual information to compare clustering. Specifically, the AMI has a value 
equal to 1 when the two groupings are identical and 0 when the MI between the two groupings is equal to the 
expected value [40]. The AMI (17) is defined as (17) [41]: 


NMI (A,B) - E {NMI(A,B)} 
1- E {NMI(A,B)} 


AMI (A,B) = (17) 


Where E {NMI(A, B)} is the expected mutual information between A and B. 


2.5.6. Normalized mutual information 

The NMI has been widely used as a reference for evaluating metrics on clustering models by measuring 
the level of dependence between two variables or the level of similarity of information between them [42]. The 
NMI value it self has a limit from 0 to 1. If the NMI value is 0, then the two variables are independent; if the 
value is 1, then the variables have the same content. To calculate the NMI, the following (18) can be used [43]: 


_ -2X1(¥;C) 
NMI (Y,C) = HOG (18) 
where: 
Y = class labels 
C = cluster labels 


H(Y) = entropy value of Y 
H(C) = entropy value of C 
I (Y; C) = mutual information of Y and C 


3. RESULT AND DISCUSSION 
3.1. Training, testing, and production processes 

In this study, the main equipment used was a Core I7-4790 CPU @ 3.60 GHz (8 CPUs) with 32 GB 
RAM, NVIDIA GeForce GTX 1080 11 GB display, and Python 3.7 programming language. These 
specifications were used so that the research can run well as expected and adjust to the equipment used by the 
author in conducting research. The training, testing, and production processes in this study were conducted 
through preprocessing, feature extraction, and classification. 


3.1.1. Preprocessing 


The data used in a research process may not always be in ideal conditions nor ready to be processed. 
Thus, preprocessing is used to improve the quality of the image to be processed for a better analysis. 
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Preprocessing can reduce the interference that exists in the data. In this study, the preprocessing of the training 
and testing processes was done in three stages: video conversion to image sequences, object annotation, and 
image resizing. While the production process went in two stages: person detection and image resizing. 
a. Video to image sequence conversion stage 

In the training and testing process as well as the production process, some videos obtained from 
YouTube or military cameras needed to be converted first into image sequences so that they can be processed 
further. The stages of conversion from video to image sequences were conducted using the video to image 
converter application. The image sequence in the training and testing processes was used as input for the object 
annotation stage, while in the production process, the image sequence allowed the MobileNet-SSD to perform 
the person detection process. 
b. Object annotation stage 

The object annotation stage was conducted during the training and testing processes by using the label 
image application. Figures 10-15 show the object annotation results for each class. This step was performed on 
each image to create a bounding box and was given the position and classification information as combatants, 
non-combatants, or civilians. From the object annotation process, the research dataset was collected. 


ca 


Figure 10. One of the images of combatants [44] Figure 11. The process of annotating to one of 
combatant images 


Figure 12. One of the pictures of non Figure 13. The process of annotating to one of non 
combatants [45] combatant images 


Figure 14. One of the pictures of civilian [46] Figure 15. The process of annotating to one of 
civilian images 
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From the object annotation stage, the images used as the target dataset were 10,154 combatant images, 
10,120 non-combatant images, and 10,118 civilian images, for a total of 30,392 images. The images were then 
used for training and validating data with as many as 200 epochs. The distribution of images is as follows: i) 
for the combatant images, the training data consisted of 7,154 images, and the validating data consisted of 
3,000 images; ii) for non-combatant images, the training data consisted of 7,120 images, and the validating 
data consisted of 3,000 images; and iii) for civil images, a total of 7,118 images were used for training data, 
and 3,000 images were used for validating data. 
c. Image resizing stage 

The object annotation stage produced combatants, non-combatants, and civilians images that still vary 
in size. To suit the network architecture used in this study, the images needed to be resized. To follow the 
requirements of the network architectures in this research (Inception-Resnet-v2 and Inception-v4), the images 
were resized to 299x299 pixels. 
d. Person detection stage 

In the training and testing processes, as shown in Figure 3, person detection was performed through 
Inception-Resnet-v2 or Inception-v4 after the input images were obtained. Inception-Resnet-v2 and Inception- 
v4 are sub-networks within the siamese network. Meanwhile, in the production process 
(see Figure 4), person detection was carried out through MobileNet-SSD. 


3.1.2. Feature extraction 

The images with a size of 299x299 pixels resulting from the preprocessing process became inputs for 
the feature extraction stage in which the images turn into human features (embedding) in N-dimension. By 
using the three channels (RGB) image input, the feature extraction step was done to produce 
N-dimensional vectors. The size of 299x299 pixels was a compatible size with the network architectures used 
in the next process in the system/model design. Figure 16 is an illustration of N-dimensional vectors of anchor, 
positive and negative images. 
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Figure 16. N-dimensional vectors 


Electrostatic loss as an objective function in the system/model design serves to minimize the vector 
distance from the corresponding images and to keep the vector distance from incompatible images, both anchor 
images, positive images, and negative images. The closer the vector distance is from the corresponding images, 
the smaller the value of the electrostatic loss is, and vice versa. If the total value of the electrostatic loss 
decreases and falls close to zero, then the system/model design could be categorized as improving. In addition, 
if the total value of the electrostatic loss increases, then the system/model design can be categorized as 
inadequate. The vector distance of these images were calculated using the cosine similarity method, which has 
been declared better than the euclidean distance method [47]. 

In this study, a system/model design experiment was conducted using several N-dimensional vectors 
i.e., 32, 64, and 128. The comparison of the several N-dimensional vectors yielded the result that 128- 
Dimensional vectors gave the best results. Nevertheless, when the set conditions follow, the maximum error 
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that is still allowed is below 5% [48]. This result is discussed further in the sub-chapter of evaluation metrics 
in the next section. Figure 17 shows an example of 128-dimensional vectors. 
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Figure 17. 128-dimensional vectors 


3.1.3. Classification 

Classification is a data analysis process that produces models to describe classes contained in data 
[37]. A system/model design can be created from the classification of training data provided and used on new 
data. It is expected that the system is able to classify all data correctly. Therefore, some adjustments are 
necessary to make the error smaller and closer to zero. In this study, after the human features (embedding) were 
obtained, they were then identified and classified by three classes, namely combatants, non-combatants, or 
civilians. When it does not match the three classes it will be categorized as unknown. 


3.2. Evaluation metrics analysis of research results 

As explained in the previous chapter, the methods used to evaluate the system design/model include 
the PCA, accuracy, mAP, R-precision, AMI, and NMI method. The training process and system/model 
validation were conducted for up to 200 epochs under several conditions. The first step of the experiment is 
changing dimensional vectors’ sizes into 32, 64, and 128 dimensions followed by changing the network 
architecture into Inception ResNet-v2 and Inception-v4. In the last step, changing loss function is done using 
triplet loss and electrostatic loss. 


3.2.1. Principal component analysis 

The purpose of PCA on a system/model design is to facilitate observations. With PCA, dimensions 
are reduced from 128-dimensional vectors to 3-dimensional vectors. Figures 18-21 demonstrate dimensional 
vectors using PCA. Figure 18 exhibits the result of PCA on Inception ResNet-v2 32-Dimensional vectors. 
Figure 19 shows the result of PCA on Inception ResNet-v2 64-dimensional vectors. Figure 20 explains the 
PCA results on Inception ResNet-v2 128-dimensional vectors. Lastly, Figure 21 indicates the PCA results on 
Inception-v4 128-dimensional vectors. Among the PCA results in different dimensional vectors, the Inception 
ResNet-v2 128-dimensional vectors network architecture had the best result because it had the smallest distance 
between vectors within the same class (interclass) and the largest distance between vectors in different classes 
(intraclass). 
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Figure 18. PCA results on Inception ResNet-v2 32- 
dimensional vectors into 3-dimensional vectors 
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Figure 20. Results on Inception ResNet-v2 128- 
dimensional vectors into 3-dimensional vectors 
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Figure 19. PCA results on Inception ResNet-v2 64- 
dimensional vectors into 3-dimensional vectors 
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Figure 21. PCA results on Inception-v4 128- 
dimensional vectors into 3-dimensional vectors 


3.2.2. Accuracy, mean average precision, R-precision, adjusted mutual information, and normalized 


mutual information analyses 


The use of accuracy, mAP, and R-precision for a system/model design analysis aims to measure the 
accuracy level of the classification in the system. While AMI and NMI are used to measure the quality of the 
class clusters from the system design/research model, as well as the distance between the interclass and 
intraclass vectors. This study included three analyses to find the differences in dimensional vectors, in network 


architecture, and in loss functions, respectively. 
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a. Analysis of different dimensional vectors 

With different dimensional vectors (32, 64, and 128) and a fixed network architecture, Inception 
Resnet-v2, the system/model design trial process showed different accuracy values (see Table 2). Table 2 
indicates that the highest value was generated by the 128-dimensional-vector system (accuracy values 
=0.99468 1, mAP=0.994385, R-precision=0.992908, AMI=0.964917, and NMI=0.965031). Results concluded 
the larger the dimensional vectors, the greater the iteration time, and the better the results as also seen in 
previous research conducted by Schroff et al. [11]. 


Tabel 2. The values of accuracy resulted from mAP and R-precision, AMI and NMI Analysis using 
electrostatic loss with different dimensional vectors 32, 64, and 128 
Name Accuracy mAP R-precision AMI NMI 
Electrostatic loss [32] Inception ResNet-v2 0.875887 0.883422 0.880024 0.650689 0.651830 
Electrostatic loss [64] Inception ResNet-v2 0.907801 0.915632 0.906028 0.686539 0.687562 
Electrostatic loss [128] Inception ResNet-v2 0.994681 0.994385 0.992908 0.964917 0.965031 


b. Analysis of different network architectures 

In the other experiment, the Inception ResNet-v2 and Inception-v4 network architectures were used 
with electrostatic loss. Inception ResNet-v2 is the latest network architecture developed from the previous 
version of the network architecture by Szegedy et al. [33]. Besides, Inception ResNet-v2 and Inception-v4 have 
an accuracy of more than 80% and an operating cost of fewer than 15 G-FLOPs. Although several network 
architectures have an accuracy of more than 80% (NASNET-A-large and SENet 154), they are not used due to 
their operating cost of more than 20 G-FLOPs and memory usage greater than Inception ResNet-v2 and 
Inception- v4 [49]. Table 3 depicts the results of the system/model design trial process with the aforementioned 
settings. Table 3 indicates that the Inception ResNet-v2 network architecture system generates the highest value 
(accuracy values=0.994681, mAp=0.994385, R-precision=0.992908, AMI=0.964917, and NMI=0.96503 1). 


Tabel 3. The values of accuracy from mAP, R-precision, AMI, and NMI analyses of the system/model 
designs using electrostatic loss and Inception ResNet-v2 and Inception-v4 
Name Accuracy mAP R-precision AMI NMI 
Electrostatic loss [128] Inception-v4 0.898936 0.910757 0.908983 0.756541 0.757332 
Electrostatic loss [128] Inception ResNet-v2 0.994681 0.994385 0.992908 0.964917 0.965031 


c. Analysis of different loss function 

In designing a system/model, this study employed different loss functions, namely triplet loss and 
electrostatic loss. Table 4 shows the results of system/model design trial process using different loss functions. 
Table 4 indicates that the system/model design trained using electrostatic loss produced a higher value than 
triplet loss (accuracy value=0.994681, mAp=0.994385, R-precision=0.992908, AMI=0.964917, and 
NMI=0.965031). Electrostatic loss is developed from triplet loss in which a calculation term was added to the 
distance between positive and negative images. In addition, Coulomb's law was also added to reduce the inter- 
class distance and increase the intraclass distance according to (5). 


Tabel 4. The accuracy values from mAP, and R-precision analyses of the system/model designs within the 
[128] Inception ResNet-v2 network architecture with triplet loss and electrostatic loss 
Name Accuracy mAP R-precision AMI NMI 
Triplet loss [128] Inception ResNet-v2 0.918440 0.921099 0.916667 0.738010 0.738868 
Electrostatic loss [128] Inception ResNet-v2 0.994681 0.994385 0.992908 0.964917 0.965031 


4. CONCLUSION 

Based on the results analyses, this study concludes. Regarding the ability of detection and 
classification, electrostatic loss produces better accuracy than triplet loss. The use of electrostatic force based 
on the Coulomb's law suggests removing f (xn) from f (xa) and f (xp) is taken to prevent the vector distance 
between f (xa) and f (xn) smaller than that between f (xa) and f (xp). It is necessary to adjust the rules between 
Coulomb's law and Electrostatic Loss to images which are analogous to a charged particle. With electrostatic 
loss, Inception ResNet-v2 with 128-dimensional vectors yields the best network architecture. Electrostatic loss 
can be used as an alternative loss function through the deep metric learning. 
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The results of this study still warrant further investigation. Future research may conduct the following 
actions. Using electrostatic loss in designing a system/model with more classes, datasets, and dimensional 
vectors higher than 128. Using other equipment that has a higher capability than the equipment used in this 
study 
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